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Abstract 



We propose a new online learning model for learning with preference feedback. 
The model is especially suited for applications like web search and recommender 
systems, where preference data is readily available from implicit user feedback 
(e.g. clicks). In particular, at each time step a potentially structured object (e.g. a 
ranking) is presented to the user in response to a context (e.g. query), providing 
him or her with some unobserved amount of utility. As feedback the algorithm 
receives an improved object that would have provided higher utility. We propose 
a learning algorithm with provable regret bounds for this online learning setting 
and demonstrate its effectiveness on a web-search application. The new learning 
model also applies to many other interactive learning problems and admits several 
interesting extensions. 



1 Introduction 

Our new learning model is motivated by how users interact with a web-search engine or a recom- 
mender system. At each time step, the user issues a query and the system responds by supplying a 
list of results. The user views some of the results and selects those that he or she prefers. Here are 
two such examples: 

Web Search: In response to a query, the search engine presents the ranking [A, B, C, D, E, ...] and 
observes that the user clicks on documents C and D. 

Movie Recommendation: An online service recommends movie A to a user However, the user 
ignores the recommendation and instead rents another movie B after some browsing. 



In both cases the user feedback comes in the form of a preference. In the web search example, 
we can infer that the user would have preferred the ranking [C, D, A, B, E, ...] over the one we 
presented |6|. In the recommendation example, movie B was preferred over movie A. The cardinal 
utilities of the predictions, however, are never observed, and the algorithm typically does not get the 
optimal ranking/movie as feedback. 

This preference feedback is different from conventional online learning models. In the simplest 
form of the multi-armed bandit problem 12] HI E], an algorithm chooses an action (out of K possible 
actions) and observes reward only for that action. Conversely, rewards of all possible actions are 
revealed in the case of learning with expert advice |3|. Our model, where the ordering of two arms 
is revealed (the one we presented and the one we receive as feedback), sits between the expert and 
the bandit setting. A similar relationship holds for online convex optimization |9 1 and online convex 
optimization in the bandit setting fS], which can be viewed as continuous extensions of the expert 
and the bandit problems respectively, since they rely on observing either a full convex function or 
the value of a convex functions after each iteration. Most closely related to our work is the dueling 
bandits setting ||7][8], but existing algorithms are known to converge rather slowly. 



1 



In the following, we formally define the online preference learning model and a notion of regret, pro- 
pose a simple algorithm for which we prove a regret bound, and empirically evaluate the algorithm 
on a web-search problem. 



2 Online Preference Learning Model 

The online preference learning model is defined as follows. At each round t, the learning algorithm 
receives a context xt E X and presents a (possibly structured) object yt e 3^. In response, the user 
returns an object yt E y which the algorithm receives as feedback. For example, in web-search, 
a user issues a query and is presented with a ranked list of URL's (yj). The user interacts with the 
ranking that was provided to her by clicking on results that are relevant to her. This user interaction 
allows us to infer a better ranking yt to this user 

We assume that the user evaluates rankings according to a utility function C/(x, y) that is unknown 
to the learning algorithm. A natural way to define regret in this model is based on the difference 
in utility U{xt,yl) — U{xt,yt) between the object yt we present and the best possible objects 
yj = argmaXyJ7(xt, y) that could have been presented. The goal of an algorithm is to minimize 

1 ^ 

REGRETt := (C/(xt,yr) - U{xt,yt)) . (1) 

To prove bounds on the regret, we specify the properties of the user's preference feedback more 
precisely. We say that user feedback is a-informative, if for some a G (0, 1] and > 

(C/(xt,yO - C/(xt,yO) = a {Ui^uyD - U{xuyt)) ~ 6- (2) 

Intuitively, the above definition describes the quality of feedback by how much the utility of the user 
feedback yt is higher than that of the algorithm's prediction yt in terms of an (unknown) fraction a 
of the maximum possible utility range. Note that > is a slack variable that captures noise in the 
feedback. 

In the following, we use a linear model for the utility function 

;7(x,y)=w*T<^(x,y), (3) 

where w* G is an unknown parameter vector and (p : X xy is a joint feature map such 

that ||0(x, y)|| < R for any x e A" and y ey. 



3 Algorithm 

We propose the algorithm in Figure [T| for the 
online preference learning problem. It main- 
tains a vector wt and predicts the object with 
the highest utility according to wt in each it- 
eration t. It then receives feedback yt and up- 
dates Wt in the direction 0(xt , yt) - 0(xt , yt). 

Theorem 1 Under a-informative feedback 
the algorithm in Figure\l\has regret 



Initialize wi 
for i = 1 to T do 

Observe xt 

Present yt ^ argmaXyg-^Wt^0(xt, y) 
Obtain feedback yt 

Update: Wt+i ^ Wt + (?!)(xt,yf)-0(xt,yf) 
end for 

Figure 1: Preference Perceptron. 



REGRETt <—Y,£.t+ " " ■ (4) 

Proof of the above theorem is provided in the Appendix lAl When the user feedback is noise free, 
the first term on the right hand side of the above bound vanishes. The average regret in this case 
approaches zero at the rate 1 / ^/T. In addition to this result, we have the following extensions which 
we cannot provide here due to space limitations: 

• It is possible to further weaken the requirement on the feedback. Instead of requiring a- 
informative feedback, the user is required to give a-informative feedback in expectation. 
We can show a result similar to that in Theorem[T|in this case. 
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• It is also possible to show that an algorithm different from Algorithm [T] can min- 
imize any convex loss (under mild assumptions) defined on the utility difference 

w*T ((/.(xt,yO-0(xt,y*)). 



4 Experiments 
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Figure 2: Average regret versus time based on noise free a-informative feedback. 

We applied our Preference Perceptron algorithm to the Yahoo! learning to rank dataset [4|. This 
dataset consists of query-url features (denoted as for query q and URL i for that particular query) 
with a relevance rating which ranges from zero (irrelevant) to four (perfectly relevant). We first 
computed the best least squares fit to the relevance labels from the features using the entire dataset 
and all the utilities in our experiment are reported with respect to this w* . 

To pose ranking as a structured prediction problem, we defined our joint feature map as follows: 

i—l 

In the above equation, y denotes a ranking. In particular, y^ is the index of the URL which is placed 
at position i in the ranking. Thus, the above measure considers the top five URLs for a query q and 
computes a score based on a graded relevance. The above feature-map and utility are motivated from 

the definition: DCG@5(g,y) — iog(I+i) • Effectively, our utility score Q mimics DCG@5 

by replacing the relevance label with a linear prediction based on the features. 

For query qt at time step t, the Preference Perceptron algorithm present the ranking y? that maxi- 
mizes ■wj(l){q, y). Note that this merely amounts to sorting documents by the scores x^*, which 
can be done very efficiently. Once a ranking (y'^* ) was presented to a user, the user returns a ranking 
y"^* . The exact nature of user feedback differed in the two experiments; the details of feedback can 
be found below. Query ordering was randomly permuted twenty times and all the results reported 
are an average over the runs. 

The utility regret in Eqn. ([T]i, based on the definition of utility in (|5]), is given by 

y X]t=i('^*^'^('?t' y"^'*) ~ 4'{Qt,y'^*))- Here y"* denotes the optimal ranking with respect to w*. 
We also present our results on another quantity which we refer to as the DCG* regret. Since for 
every query-URL pair there is a manual relevance judgment in the dataset, optimal DCG can be 
computed by sorting the relevance score. In DCG* regret, we measure the difference between the 
DCG of the optimal ranking and that of the rankings we present in each step. 

a-informative feedback The goal of the first experiment was to see how the regret of the algo- 
rithm changes with a, assuming a-informative feedback without noise. Once a ranking was pre- 
sented, the feedback was obtained as follows: given a ranked list, the simulated user would go down 
the list and would stop when she found five URL's such that, when they are placed at the top of 
the list (in the order of their utilities), gave noise free a-informative feedback (i.e. = 0) based 
on w*. Figure |2] shows the results for this experiment for two different a values. As expected, the 
regret with a = 1.0 is lower compared to the regret with respect a = 0.1. Note, however, that 
the difference between the two curves is much smaller than a factor of ten. This is because, strictly 
a-informative feedback is also strictly /3-informative feedback for any /3 < a. So, there could be 
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several instances where user feedback was much stronger than what was required. Since the slack 
variables are zero, the average utility regret approaches zero as expected. 




Figure 3: Regret versus time based on actual relevance labels. 



Relevance label feedback In this experiment, feedback was based on the actual relevance labels 
in the dataset as follows: given a ranking for a query, the user would go down the list inspecting the 
top 25 (or all the URLs if the list is shorter) URLs. Five URL's with the highest relevance labels (r^) 
are placed at the top five locations in the user feedback. Note that this is a noisy version of feedback 
since the linear fit cannot describe the labels exactly in this dataset. 

As a baseline, a ranking SVM was trained repeatedly. In the first iteration, a random ranking was 
presented, the feedback ranking (as mentioned in the paragraph above) was obtained. An SVM 
was trained based on the pair of examples ((gi, y'^j (ftj Y^O)- From then on, a ranking was pre- 
sented based on the prediction from the previously trained ranking SVM. The user always returned 
a ranking based on the relevance labels as mentioned above; the pairs of examples were stored after 
every iteration. Note that training a ranking SVM after each iteration would be prohibitive since it 
involves cross-validating a parameter C that trades-off between the margin and the slacks. Thus, 
we trained an SVM whenever 10% more examples were added to the training set after the previous 
training. The value of the parameter C was obtained via a five-fold cross-validationQ Once a C 
value was determined, SVM was trained on all the training examples available at that time and used 
it to predict rankings until the next training. 

Results of this experiment are presented in Figure[3] We have provided both the mean regret as well 
as one standard deviation for this experiment. Since the feedback is now based on relevance labels 
(and not on a linear fit), the utility regret converges to a non-zero value. It can also be noticed that 
our preference perceptron performs significantly better compared to the SVM. It might be possible 
to improve the performance of the SVM by training it more often. However, this would be extremely 
prohibitive. For instance, the perceptron algorithm took around 30 minutes to run (which was mostly 
inefficient Python lO), whereas the SVM version took about 20 hours (on the same machine). 

5 Conclusions 

We proposed a new model of online learning with preferences that is especially suitable for implicit 
user feedback. An efficient algorithm was proposed that provably minimizes regret. Experiments 
showed its effectiveness for web-search ranking. 
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A Proof of theorem 1 

Proof First, consider the inner product of w^+i with itself. We have, 

'^T+i'^T+i = wJwt + 2wJ((/)(xt, Yt) — 0(xt, Vt)) 

+ (0(xT,yT) - (?!)(xT,yT))^((/'(xT,yT) - <j>{xT, yr)) 

< wJwt + 4i?^ 

< iR^T. 

On the first hne, we simply used our update rule from algorithm [T] On the second line, we used 
the fact that wJ(0(xT,yT) — 0(xT,yT)) < from the choice of in Algorithm [T] and that 
|j(/)(xT, yt) — 0(xt, yT)|P < 4i?^. We obtain the last line inductively. 

Further, from the update rule in algorithmlT] we have, 

wj^j^w* = wjw* + w*^ {(f>(xT, yt) — '/'(xt, Yt)) 

T 

= ^w*T (^(xi,y,)-'/'(xt,yt)) 
t=i 

T 

t=i 

We now use the fact that wj_|^j^w* < ||w*|| ||wt+i|| (Cauchy-Schwarz inequality) which implies, 

T 

(f/(xt,yt) - f/(xt,yt)) < 2i?VT||w*||. 

t=i 

The above inequaUty, along with the a-informative feedback (Eqn. (O) gives, 

T T 

(t/(xt,y:) - C/(x,,yO) - ^6 < 2i?Vr||w*||. 

t=i t=i 

from which the claimed result follows. ■ 
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